library(stats)
library(dplyr)
library(ggplot2)
library(devtools)
library(gridExtra)
library(grid)
library(tidyverse)
library(knitr)
library(lubridate)
library(reshape2)
library(vcd)
library(maps)
library(resample)
library(rjags)
library(kableExtra)

1 Executive Summary

We use historical data of earthquakes and estimate probabilities of large earthquakes occurring in eight different countries in year 2022. The Bayesian framework for inference is used and Poisson processes are assumed for temporal occurrences of earthquakes. Posterior distributions for the expected mean of rate for the different countries are obtained, however, shortcomings of the current approach are also identified. These include: 1. the data do not strictly follow Poisson process and 2. there is a presence of possible correlation between the observations.

2 Introduction

In some parts of the world, earthquakes pose great natural threats to humans which cause immense damage and affect many people’s lives. Despite its relevance, predicting occurrences of major earthquakes remains a big challenge. In this report, we present an attempt to estimate the probability of large earthquakes occurring in a given country within the year 2022 using the Bayesian framework of inference.

3 Data

Our goal is to make statistical inference for the probabilities of earthquake occurring in the future. To this end, we use the historical data of earthquakes, which we collect from a worldwide earthquake catalog from the United States Geological Survey database (https://earthquake.usgs.gov/earthquakes/search/). We analyze the earthquakes that occurred after 1970 with the magnitude greater than 7.0. We restricted our data set to these conditions because the records prior to 1970 and below the magnitude 6.0 could be prone to errors and incompleteness (see the website for details).

earthquakes <- read.csv('earthquakes_processed.csv', header = TRUE)
earthquakes$time <- as.Date(substr(earthquakes$time,1,10), format('%Y-%m-%d'))
earthquakes$year <- year(earthquakes$time)
print(head(earthquakes), row.names=FALSE)
##        time latitude longitude mag     country year
##  2021-11-28  -4.4528  -76.8109 7.5        Peru 2021
##  2021-10-02 -21.1265  174.8958 7.3     Vanuatu 2021
##  2021-09-08  16.9465  -99.7530 7.0      Mexico 2021
##  2021-08-14  18.4335  -73.4822 7.2       Haiti 2021
##  2021-08-11   6.4748  126.7151 7.1 Philippines 2021
##  2021-07-29  55.3635 -157.8876 8.2         USA 2021
print(tail(earthquakes), row.names=FALSE)
##        time latitude longitude mag     country year
##  1970-05-27   27.236   140.230 7.1       Japan 1970
##  1970-04-29   14.520   -92.653 7.3      Mexico 1970
##  1970-04-07   15.791   121.630 7.4 Philippines 1970
##  1970-02-28   52.487  -174.915 7.1         USA 1970
##  1970-01-10    6.785   126.682 7.2 Philippines 1970
##  1970-01-04   24.185   102.543 7.1       China 1970

The earthquakes in the data set are visualized on the world map below (Figure 3.1). Note that earthquakes that occurred far out into the sea are omitted from the data set.

world_map <- map_data("world")
p <- ggplot() + coord_fixed() + xlab("") + ylab("")
base_world_messy <- p + geom_polygon(data=world_map, aes(x=long, y=lat, group=group), colour="light green", fill="light green")
cleanup <- theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        panel.background = element_rect(fill = 'white', colour = 'white'), 
        axis.line = element_line(colour = "white"), legend.position="none",
        axis.ticks=element_blank(), axis.text.x=element_blank(),
        axis.text.y=element_blank())
base_world <- base_world_messy + cleanup
map_data <- base_world + geom_point(data=earthquakes, aes(x=longitude, y=latitude, size=mag), colour="Deep Pink", fill="Pink", pch=21, alpha=I(0.7)) 
print(map_data)
World map with major earthquakes since 1970.

Figure 3.1: World map with major earthquakes since 1970.

agg <- earthquakes %>% 
  group_by(year, country) %>% 
  summarise(count=n(), .groups = 'drop') %>% 
  pivot_wider(names_from = country, values_from = count, values_fill = 0)
names(agg) <- make.names(names(agg), unique=TRUE)

We limit the number of countries for the analysis to eight. The choice of this number is completely arbitrary and this is solely for brevity’s sake. We take the top countries with the largest number of earthquakes into our analysis. The sample data shown below holds the number of earthquakes by year per country (see Table 3.1). The same data are shown in a time series format in Figure 3.2.

# Select countries with total earthquakes >= 30 (this filter is arbitrary)
cols <- which(colSums(agg) >= 30)
agg <- agg %>% select(cols)
kable(tail(agg,3), caption="The earthquake dataset.") %>% kable_styling(html_font=8)
Table 3.1: The earthquake dataset.
year Japan Philippines Solomon.Islands USA Guinea Indonesia Russia Vanuatu
2019 0 0 0 0 2 2 0 0
2020 0 0 0 2 1 0 2 0
2021 2 2 0 1 0 0 0 1
print(summary(agg[,-1]))
##      Japan         Philippines     Solomon.Islands       USA        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.9231   Mean   :0.7692   Mean   :0.5962   Mean   :0.7115  
##  3rd Qu.:2.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :6.0000   Max.   :3.0000   Max.   :4.0000   Max.   :2.0000  
##      Guinea        Indonesia         Russia          Vanuatu      
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.000   Median :1.000   Median :0.0000   Median :0.0000  
##  Mean   :1.077   Mean   :1.423   Mean   :0.6731   Mean   :0.8077  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :5.000   Max.   :4.0000   Max.   :3.0000
agg_melt <- melt(agg, id.vars = 'year', variable.name = 'country')
print(ggplot(agg_melt, aes(year, value)) + geom_line(aes(colour = country)))
Time series of earthquake occurances.

Figure 3.2: Time series of earthquake occurances.

We plot the auto-correlations of the time series (see Figure 3.3) to evaluate the that the independence of the earthquakes. We acknowledge the level of auto-correlation in the plot for Guinean and other countries. These are interesting observations and deserve further investigations, but it is for now out of scope for this assignment and, therefore, we proceed with our analysis.

p <- c()
i <- 1
for(column in colnames(agg))
{
  if (column != "year")
  {
    bacf <- acf(agg[[column]], plot=FALSE)
    bacfdf <- with(bacf, data.frame(lag, acf))
    p[[i]] <- ggplot(data=bacfdf, mapping=aes(x=lag, y=acf)) + 
      geom_segment(mapping=aes(xend=lag, yend=0)) + 
      geom_hline(aes(yintercept = 0.2), linetype = 3, color = 'darkblue') + 
      geom_hline(aes(yintercept = -0.2), linetype = 3, color = 'darkblue') +
      ggtitle(column) +
      theme(plot.title = element_text(hjust = 0.5))
    i = i + 1
  }
}

grid.arrange(
  p[[1]], p[[2]], p[[3]], p[[4]], p[[5]], p[[6]], p[[7]], p[[8]],
  nrow=2,
  bottom = textGrob(
    "",
    gp = gpar(fontface=3, fontsize=9),
    hjust=1,
    x=1
  )
)
Autocorrelation of the timeseries.

Figure 3.3: Autocorrelation of the timeseries.

We plot the distribution of the earthquakes by country in Figure 3.4. All are positively skewed with median lying on either 0 or 1 and mean taking value between 0.60 and 1.07.

h <- c()
i <- 1
for(column in colnames(agg))
{
  if (column != "year")
  {
    h[[i]] <- ggplot(agg, aes_string(column)) + geom_histogram(binwidth=1, fill='blue', alpha=0.5)
    i = i + 1
  }
}

grid.arrange(
  h[[1]], h[[2]], h[[3]], h[[4]], h[[5]], h[[6]], h[[7]], h[[8]],
  nrow=2,
  bottom = textGrob(
    "",
    gp = gpar(fontface=3, fontsize=9),
    hjust=1,
    x=1
  )
)
Histogram: number of earthquakes in a year

Figure 3.4: Histogram: number of earthquakes in a year

Finally, we run a goodness of fit test for each distribution using the Chi-squared statistics to check whether the data are not obviously inconsistent with the underlying distribution being Poisson (see Table 3.2). If the p value of the test is larger than 0.05, we can support the null hypothesis which states that the process is a Poisson process. Again, we acknowledge that for some countries, we fail to support the null hypothesis. There are other distributions that are known to fit the earthquakes temporal distribution better [Min-Hao Wua: Earthquake, Poisson and Weibull distributions], however, we carry on with the inference task assuming a Poisson process for each distribution.

pvalues <- c()
i=1
for(column in colnames(agg))
{
  if (column != "year")
  {
    gf = goodfit(agg[[column]], type="poisson", method="ML")
    gf.summary = capture.output(summary(gf))[[5]]
    pvalue = unlist(strsplit(gf.summary, split = " "))
    pvalue = as.numeric(pvalue[length(pvalue)])
    pvalues[[i]] <- c(column, pvalue)
    i = i + 1
  }
}
pvalues <- as.data.frame(do.call(rbind, pvalues))
colnames(pvalues) <- c("country", "p-value")
kable(pvalues, caption='pvalues of Chi-square tests') %>% kable_styling(html_font=8)
Table 3.2: pvalues of Chi-square tests
country p-value
Japan 0.02159118
Philippines 0.3905217
Solomon.Islands 0.007532658
USA 0.02987068
Guinea 0.4667118
Indonesia 0.04640559
Russia 0.6198651
Vanuatu 0.03627001

4 Model

We assume that the temporal occurrence of earthquakes with magnitude greater than 7.0 is a Poisson process: i.e. independent, stationary and do not occur simultaneously. Therefore, we employ Poisson likelihood with a hierarchical structure, where the hierarchical grouping is done by country. Each country sits on a different junction tectonic plates and hence the earthquakes occurring in the same country should share the same distribution parameter (lambda), but across different countries, it is natural to assume that these lambdas are different. We specify Gamma prior on the intra-country lambdas (\(\lambda\)[1],…,\(\lambda\)[8]), whose mean (\(\mu\)) comes from yet another Gamma distribution with the mean at the empirical mean of the observation and standard deviation from an exponential distribution. Note that \(\mu\) is our prior on the inter-country mean of the rate as well.

stats <- merge(x=as.data.frame(colMeans(agg)), y=as.data.frame(colVars(agg)), by=0) %>% filter(Row.names != 'year')
colnames(stats) <- c("country","mean", "variance")
print(paste0("Inter-country empirical mean: ", mean(stats$mean)))
print(paste0("Inter-country empirical variance: ", var(stats$mean)))

We fit the model using JAGS and R and generate three chains of simulation, but throw away the first 1000 steps as burn-in. We then produce 5000 more steps for each chain.

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 416
##    Unobserved stochastic nodes: 10
##    Total graph size: 848
## 
## Initializing model

The trace plots of the parameters and Gelman-Rubin diagnostics indicate the convergence of the simulations (i.e. potential scale reduction factors all close to 1), and the effective sample size of all parameters are on the order of thousands, which guarantees us a reliable estimation of credible intervals.

# convergence diagnostics
plot(mod_sim)

gelman.diag(mod_sim)
## Potential scale reduction factors:
## 
##        Point est. Upper C.I.
## lam[1]       1.00       1.00
## lam[2]       1.00       1.00
## lam[3]       1.00       1.00
## lam[4]       1.00       1.00
## lam[5]       1.00       1.00
## lam[6]       1.00       1.01
## lam[7]       1.00       1.00
## lam[8]       1.00       1.00
## mu           1.01       1.01
## sig          1.01       1.03
## 
## Multivariate psrf
## 
## 1
autocorr.diag(mod_csim)
##              lam[1]        lam[2]      lam[3]      lam[4]      lam[5]
## Lag 0   1.000000000  1.0000000000 1.000000000 1.000000000 1.000000000
## Lag 1   0.052191263  0.0563018452 0.150071221 0.063227299 0.105055018
## Lag 5   0.026208785  0.0072873467 0.028144419 0.005636768 0.021745319
## Lag 10  0.012393523  0.0140069350 0.023733344 0.002579963 0.002769695
## Lag 50 -0.003217542 -0.0001217229 0.006367991 0.006053059 0.003041239
##             lam[6]       lam[7]       lam[8]          mu         sig
## Lag 0  1.000000000  1.000000000  1.000000000  1.00000000  1.00000000
## Lag 1  0.257889508  0.090720399  0.044974645  0.56876611  0.79164505
## Lag 5  0.077897242 -0.002228956 -0.001033791  0.14763006  0.36030523
## Lag 10 0.027960982  0.008373325 -0.004228180  0.04570664  0.13050156
## Lag 50 0.004550169  0.010604624 -0.015853268 -0.01452142 -0.01772157
autocorr.plot(mod_csim)

effectiveSize(mod_csim)
##    lam[1]    lam[2]    lam[3]    lam[4]    lam[5]    lam[6]    lam[7]    lam[8] 
## 12007.231 12841.465  7903.343 11892.701  8986.526  4948.248 10954.022 12734.332 
##        mu       sig 
##  3147.077  1539.804

We check the fit via residuals. With a hierarchical model, there are two levels of residuals: the observation level and the country mean level. To simplify, we look at the residuals associated with the posterior means of the parameters.

The observation residuals, based on the estimates of country means, seems to be right skewed indicating that the model struggles to fit to years when an unexpectedly large number of earthquakes occurred. For example, in 2011 in Japan, there were in total 6 earthquakes with magnitude greater 7. These earthquakes were thought to had been correlated (i.e. triggered by the first earthquake with magnitude 9 in the Tohoku area). Our model obviously fails to capture this accurately since the independence of events are assumed. The country mean level residual on the other hand look fine. Note that we omitted the plots for the limitation of pages but these were produced in the source code written in R.

pm_params = colMeans(mod_csim)
yhat = rep(pm_params[1:8], each=52)

# Observation level residuals
resid = agg_melt$value - yhat
plot(resid)

plot(jitter(yhat), resid)

# Country level residuals
lam_resid = pm_params[1:8] - pm_params["mu"]
print(lam_resid)
##      lam[1]      lam[2]      lam[3]      lam[4]      lam[5]      lam[6] 
##  0.01903169 -0.09665697 -0.22630709 -0.13978788  0.13235206  0.39126484 
##      lam[7]      lam[8] 
## -0.16969524 -0.06724720
plot(lam_resid)
abline(h=0, lty=2)

5 Results

We present the posterior summary in Table 5.1. The means of the parameter lambda in Poisson posterior distribution is the expected rate of occurrence. For example, \(\lambda\)[1] is the expected mean rate for Japan and the model states that with probability 0.5981315 [1-ppois(0,0.9116303))], there will be at least one earthquake with a magnitude greater than 7.0 occurring in Japan in 2022. The same statement could be made for other countries by using the posterior distribution of their lambdas. We acknowledge that for some countries, the data are not statistically consistent with the Poisson process assumption and also from inspecting the residual plot of the fit, it is clear that the models are not always accurate.

kable(as.data.frame((summary(mod_sim)$statistics)), caption='Posterior distribution of the parameters.') %>% kable_styling(html_font=8)
Table 5.1: Posterior distribution of the parameters.
Mean SD Naive SE Time-series SE
lam[1] 0.9125559 0.1166397 0.0009524 0.0010670
lam[2] 0.7968672 0.1102064 0.0008998 0.0009675
lam[3] 0.6672171 0.1060165 0.0008656 0.0011598
lam[4] 0.7537363 0.1077143 0.0008795 0.0009982
lam[5] 1.0258762 0.1290802 0.0010539 0.0013303
lam[6] 1.2847890 0.1579988 0.0012901 0.0022152
lam[7] 0.7238289 0.1073898 0.0008768 0.0010186
lam[8] 0.8262770 0.1115394 0.0009107 0.0009841
mu 0.8935242 0.1154030 0.0009423 0.0021109
sig 0.2649623 0.1247689 0.0010187 0.0032011

6 Conclusions

In this report, we estimated the probability of large earthquakes occurring in a given country within the year 2022. We have assumed Poisson processes for earthquake temporal occurrence distributions and applied hierarchical models based on countries. We have obtained the posterior distributions for the expected mean of the rate for eight different countries. However, along the process, we have identified some shortcomings of the approach (i.e. the data not strictly following Poisson process, possible correlation between the observations).